A Classical Chinese Corpus with Nested Part-of-Speech Tags
نویسنده
چکیده
We introduce a corpus of classical Chinese poems that has been word segmented and tagged with parts-ofspeech (POS). Due to the ill-defined concept of a ‘word’ in Chinese, previous Chinese corpora suffer from a lack of standardization in word segmentation, resulting in inconsistencies in POS tags, therefore hindering interoperability among corpora. We address this problem with nested POS tags, which accommodates different theories of wordhood and facilitates research objectives requiring annotations of the ‘word’ at different levels of granularity.
منابع مشابه
پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی
Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...
متن کاملSOAT: A Semi-Automatic Domain Ontology Acquisition Tool from Chinese Corpus
In this paper, we focus on the domain ontology acquisition from Chinese corpus by extracting rules designed for Chinese phrases. These rules are noun sequences with part-of-speech tags. Experiments show that this process can construct domain ontology prototypes efficiently and effectively.
متن کاملA Two-Stage Approach to Chinese Part-of-Speech Tagging
This paper describes a Chinese part-ofspeech tagging system based on the maximum entropy model. It presents a novel two-stage approach to using the part-ofspeech tags of the words on both sides of the current word in Chinese part-of-speech tagging. The system is evaluated on four corpora at the Fourth SIGHAN Bakeoff in the close track of the Chinese part-ofspeech tagging task.
متن کاملThe Construction of a Segmented and Part-of-speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi
In this paper, we present a segmented and part-of-speech (POS) tagged Archaic Chinese corpus along with its construction process, which is performed by automatic segmentation and tagging with manual correction as post-processing. We use both Modern and Archaic Chinese labeled data for training word segmenter and POS tagger, which are further improved by domain adaptation techniques, as well as ...
متن کاملCombining Context Features by Canonical Belief Network for Chinese Part-Of-Speech Tagging
Part-Of-Speech(POS) tagging is the essential basis of Natural language processing(NLP). In this paper, we present an algorithm that combines a variety of context features, e.g. the POS tags of the words next to the word a that needs to be tagged and the context lexical information of a by Canonical Belief Network to together determine the POS tag of a. Experiments on a Chinese corpus are conduc...
متن کامل